A message is {\it in-transit} with respect to a global state if its sending is recorded in this global state, while its receipt is not. Checkpointing algorithms have to log such in-transit messages in order to restore the state of channels when a computation has to be resumed from a consistent global state after a failure has occurred. Coordinated checkpointing algorithms log those in-transit messages exactly on stable storage. Because of their lack of synchronization, uncoordinated checkpointing algorithms conservatively log more messages. This paper presents an uncoordinated checkpointing protocol that logs all in-transit messages and the smallest possible number of non in-transit messages. As a consequence, the protocol saves stable storage space and enables quicker recoveries. An appropriate tracking of message causal dependencies constitutes the core of the protocol.
展开▼
机译:如果消息的发送记录在该全局状态中,则消息是针对全局状态的{\ it in-transit},而消息的接收方则不是。当发生故障后必须从一致的全局状态恢复计算时,检查点算法必须记录这种传输中的消息,以便恢复通道的状态。协调的检查点算法将这些传输中的消息准确记录在稳定的存储中。由于缺乏同步,不协调的检查点算法保守地记录了更多消息。本文提出了一种不协调的检查点协议,该协议记录了所有传输中的消息和最少数量的非传输中的消息。结果,该协议节省了稳定的存储空间并实现了更快的恢复。消息因果关系的适当跟踪构成了协议的核心。
展开▼